import pandas as pd # The gold standard of Python data analysis, to create and manipulate tables of data
import numpy as np # The Python module for processing arrays which/Pandas is based on
import matplotlib.pyplot as plt # The gold standard of Python data visualization, but can be complex to use
import seaborn as sns; sns.set() # A package to make Matplotlib visualizations more aesthetic
import branca
import geopandas
import folium # package for making maps, please make sure to use a version older than 1.0.0.
from wordcloud import WordCloud # A package that will allow us to make a wordcloud
from scipy.stats import ttest_ind # A module for Python machine learning--we'll stick to T-Tests here
from IPython.display import display
from folium.plugins import TimeSliderChoropleth
# from time_slider_choropleth import TimeSliderChoropleth
%matplotlib inline
plt.rcParams["figure.figsize"] = (8,5)
"I think that's how Chicago got started. A bunch of people in New York said, 'Gee, I'm enjoying the crime and the poverty, but it just isn't cold enough. Let's go west." - Richard Jeni
"Chicago is known for good steaks, expensive stores, and beautiful architecture. Unfortunately, the Windy City also enjoys a reputation for corrupt politics [and] violent crime." - Bob Barr
Exploratory data analysis is one of the most important portions of data science. This case builds upon Case 4.3 to encourage further practice with a focus on technical implementation. We hope after this case students are able to think critically and ask logical questions, as well as have the technical capability to investigate those questions. Technical capabilities in this case mainly refers to utilizing data visualizations, manipulating DataFrames in pandas and creating custom Python functions to quantify metrics.
Business Context. Congratulations! You were recently promoted to regional Chief of Strategy of the Chicago Police Department. You have many years of experience with field work, but this is your first time having to think about the bigger picture. Chicago is a large city, and your resources are limited. Thus you need to devise a comprehensive plan to enhance the efficiency of police force deployment to fight crime. Making data-driven decisions is essential, even in law enforcement where prior knowledge usually dominates the decision-making process.
Business Problem. Your main task is to explore the data and identify patterns of crime in Chicago, and come up with strategies to efficiently deploy your workforce to fight crime.
Analytical Context. So, you found a dataset available to the Chicago PD from 2017 with information on crimes committed throughout the city. In this case, we will focus on exploratory analysis to construct some preliminary strategies for police deployment. These strategies can be further consolidated or dismissed using more rigorous statistical analysis. One of the key aspects of this case is that our data contains records of crime incidents where often we do not have a clear definition of outcome (such as "severity" of a crime). We will discuss ways of dealing with such data, and how they can be incorporated to have meaningful conclusions.
The case is structured as follows. We will (1) look at univariate summaries (2) come up with a preliminary strategy based on this (3) look at joint distributions and revise our strategy, and finally (4) think about changing strategies depending on our priorities and severity of the crimes.
Let's read in and view our dataset. This dataset is downloaded from this website. It contains reported crime incidents (with the exception of murders, where data exists for each victim) that occured in the City of Chicago in 2017.
We began Case 4.3 with a basic exploration of the distribution of the various parameters. Since this dataset is more focused on categorical data, we will start by investigating the various frequencies of each category within each parameter:
df = pd.read_csv('Chicago_crime_data.csv', dtype={'ID': object, 'beat_num': object})
df.head(5)
We can see above that the table contains 22 columns and there are 268,303 records in total. Since each homicide case could have more than one row, the actual number of cases is smaller than 268,303. Blow is a brief description of each column:
| Variable name | Variable description | Note |
|---|---|---|
| ID | Unique identifier for the record | Each victim in a single homicide case is assigned to a different ID |
| Case Number | The Chicago Police Department RD (Records Division) number | Unique to the incident. Multiple IDs can share the same Case Number if the incident is a homicide case |
| Date | Date when the incident occurred | Might be a best estimate for some records |
| Block | The partially redacted address where the incident occurred | The redacted address is in the same block as the actual address |
| IUCR | The Illinois Uniform Crime Reporting code | Directly linked to the primary type and the description of the crime. See details here |
| Primary Type | The primary description of the IUCR code | - |
| Description | The secondary description of the IUCR code | - |
| Location Description | Description of the location where the incident occurred | - |
| Arrest | Whether an arrest as made | - |
| Domestic | Whether the incident was domestic-related | Domestic-related definition is based on the Illinois Domestic Violence Act |
| beat_num | The police beat where the incident occurred | Smallest police geographic area - each beat has a dedicated police beat car. See details here |
| District | The police district where the incident occurred | Three to five beats make up a police sector and three sectors make up a police district. See details here |
| Ward | The ward where the incident occurred | Wards are city council districts. See details here |
| Community Area | The community area where the incident occurred | See details here |
| FBI Code | The crime classification as outlined in the FBI's NIBRS | NIBRS stands for National Incident-Based Reporting System (NIBRS). See details here |
| Latitude | The latitude of the location where the incident occurred | This location is shifted from the actual location for partial redaction but falls on the same block |
| Longitude | The longitude of the location where the incident occurred | - |
There are quite a few factors, but most are either: (1) identifying information (e.g. id, ICUR); or (2) too granular to start with (e.g. latitude, longitude). Therefore, we will first focus on the following variables: primary_type, description, location_description (location types), date (time of occurrence) and beat_num (geographic location), which give valuable information without getting too granular too quickly. Our outcome of interest is the number of crime incidents.
Similar to the last EDA case, it makes sense to explore the relationship of primary_type and description with our outcome of interest, crime incidents. However, we cannot repeat the exact process where we look at the pairwise correlations between the variables of interest and the outcome. This is because both primary_type and description are categorical variables, so it would not make send to place them on a scatterplot and calculating correlations makes little sense.
Luckily, both variables are discrete so we can still count the total number of records which belong to a specific category for each of these two variables using a frequency table. Note that primary_type and description are nested variables, meaning each type of primary_type has its own set of descriptions that do not overlap. If two crimes have different primary types then they cannot have the same description by definition.
df["Primary Type"].value_counts()[:10].plot(kind='barh')
df["Primary Type"].value_counts()
We can see the most prevalent primary type of crime is theft, followed by battery and criminal damage. More severe types, such as homicide, arson and human trafficking, are very rare. A more detailed description of crime types is listed in the Description column. We can further break down the above frequencies by Description since Primary Type and Description are nested variables. The resulting frequency table is shown below:
Write code using the groupby function, to count the number of cases in all combinations of Primary Type and Description. Then sort the results in decreasing order of the number of cases. Based on the results, what are most prevalent descriptions of theft, battery and criminal damage cases in Chicago?
Answer.
df\
.groupby(["Primary Type","Description"])["ID"]\
.count()\
.reset_index(name="count")\
.sort_values(by="count", ascending = False)\
.reset_index(drop=True)\
.head(20)
There are in fact 310 descriptions in total and listing them all here is not viable. However, we can use a visualization tool known as a word cloud to summarize the prevalent descriptions within each primary type. A word cloud visualizes the words within a collection of texts (in our case, the texts are all Descriptions for a specific primary type) and the size of each word is proportional to how often it appears in the datase. Below, we construct three word clouds for the top 3 most prevalent primary crime types:
# wordcloud for primary type defined by rank
def wordcloud_crime( df, rank ):
# Filter all crimes with given Primary Type
condition = df["Primary Type"]==df["Primary Type"].value_counts().index[rank]
df_filter = df[condition]
# Create a String with all description separated with spaces
text = ' '.join(df_filter['Description'])
# Define WordCloud Scheme
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
# Plot
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
print("Crime type: ", df["Primary Type"].value_counts().index[0])
wordcloud_crime( df, 0 )
From the above wordcloud, it seems that the words "building" and "retail" are strongly linked to theft offenses, indicating that theft likely mainly happened indoors and in malls or retail stores.
Use the code above to generate the wordcloud for battery cases and criminal damage cases. What are the most common words to describe these two types of cases?
print("Crime type: ", df["Primary Type"].value_counts().index[1])
wordcloud_crime( df, 1 )
print("Crime type: ", df["Primary Type"].value_counts().index[2])
wordcloud_crime( df, 2 )
Answer.
Battery is strongly linked the word "domestic", implying that battery charges usually involved family members. Criminal damage was strongly associated with the words "property" and "vehicle", indicating the targets for most criminal damage cases.
As we have seen with word clouds, it seems that a given type of crime is usually linked with certain types of locations (e.g. at home, in retail stores). Write code to investigate the crime patterns associated with types of crime locations. Based on the results, which types of locations are more likely to have crime?
Answer.
As is indicated in wordcloud N.º 1, some locations as building and retail stores are more likely to have crime. There is a lot of crimes asociated with car drivers.
df["Location Description"].value_counts().head(10)
Since Location Description is a discrete variable, we can use the same code as when analyzing Primary Type. Based on the results, we can find that street, residence, apartments and sidewalk account for around 50% of all incidents.
So far, we have seen crime patterns linked with Primary Type and Location Description separately. It makes sense to see whether a certain combination of crime type and location type is prevalent or not. We know that both Primary Type and Location Description are discrete variables. We can therefore use a contingency table (cross table) to summarize the total number of incidents that belong to a specific combination of values of Primary Type and Location Description.
We cannot use the previous code where we analyzed Primary Type and Description together since unlike those two variables, Location Description and Primary Type are NOT nested variables. We can use the function crosstab in pandas to generate the contingency table of two variables.
crosstab(var1, var2) generates the contingency table for var1 vs. var2. Use this function to generate a contingency table for Primary Type vs. Location Description where only the top 10 most prevalent crime locations and types are included.
Answer.
df[['Primary Type', 'Location Description']].head(10)
cond1 = df["Location Description"].isin(df["Location Description"].value_counts().index[:10])
cond2 = df["Primary Type"].isin(df["Primary Type"].value_counts().index[:10])
#
df_1 = df[ cond1 & cond2] # When both are true
#
df_r = pd.crosstab(df_1["Primary Type"],df_1["Location Description"])
#
sns.heatmap(df_r, annot=False)
# plt.pcolor(df_r)
# plt.yticks(np.arange(0.5, len(df_r.index), 1), df_r.index)
# plt.xticks(np.arange(0.5, len(df_r.columns), 1), df_r.columns, rotation=90)
df_r
Based on the contingency table above, what are the hot spots for the top 10 most prevalent types of crime? Are they the same or not?
Answer.
According to this crosstable theft is more likely to occur in streets compared to other kinds of locations. With battery, we can see more register in apartments, residences, sidewalks and streets. Criminal damage tends to occur more in street.
How can the above table help you deploy your workforce efficiently?
Answer.
We now move on to investigate the relationship between crime incidents and time; i.e. the Date variable we pointed out early on. Time is one of the most important dimensions for constructing an effective deployment plan. Since we cannot patrol every location 24/7, we must target periods of time with high crime rates. Date gives us a timestamp for each incident, which allows us to count how many incidents happened within a given period of time. Since we have one year's worth of data, we can start with monthly total incidents to see if certain months are crime-prone.
We have covered a few cases dealing with temporal data and we generally group by different units of time (days, weeks, months) to discover different insights from the data. As we investigate from a temporal perspective, it is important to keep the concept of confounding variables we introduced in Case 4.3 in our minds. An example would be discovering a pattern regarding June and July having more crime but the underlying factor is actually temperature. If Chicago had a particularly hot September in the future, a careful data scientist would expect more crimes rather than simply concluding that September always has fewer crimes than June/July.
Convert the column Date into datatime type and get the month for each record. Use the calculated month and groupby function and plot the total number of cases in each month.
Answer.
df.Date = pd.to_datetime(df.Date, format = '%m/%d/%y %H:%M')
df['Month'] = pd.DatetimeIndex(df.Date).month
df.groupby('Month').size()\
.plot(color = 'green', marker = 'o').set(ylabel='Crime count')
Which months have relatively higher crime rates? Why?
Answer.
We can see July and August are the months with more crime counts, and these months correspond to summer break in Illinois. In summer months there are more people on streets because of the weather.
Modify your code for monthly total incidents and instead plot the time series of daily total incidents throughout 2017. Do you still believe that February is the time when crime is least concerning?
Answer.
df['Day'] = df['Date'].dt.day
df_sum_date = df.groupby(['Month', 'Day']).size()
fig, ax = plt.subplots(figsize=(8,6))
for label, dataFrame in df_sum_date.groupby('Month'):
dataFrame.plot(ax = ax, label=label)
ax.set(xlabel='Day', ylabel='Crime count', xticks=range(31), xticklabels=range(31))
plt.legend(ncol=4)
Therefore, if we want to compare level of crime across different months, monthly total is probably not a good metric since different months have different numbers of days. To resolve this issue, we will normalize the monthly total into some metric that does not depend on the number of days in a month. Normalization of our data is very common and the appropriate method of normalization is generally determined by domain knowledge and the hypothesis generation process which you recently learned in Case 4.3. You want to consistently see how your hypotheses play out in your data by slicing the data in various ways (one being normalizing certain parameters for better comparison).
A natural choice is to divide monthly total by the number of days in a month. The normalized value is indeed the average daily incidents in a month, which can be compared across different months. Let's take a look at the results if we use this normalization. From this, it is clear that March is in fact the least concerning month:
res = df.groupby(["Month"])["ID"].count().reset_index(name="count")
res["count"] = res["count"]/[31,28,31,30,31,30,31,31,30,31,30,31]
_ = res.plot(x = "Month", y = "count", title = "2017 monthly crime patterns (normalized)")
_ = plt.ylabel("Average daily total incidents")
Choose the correct normalization approach and modify the code above to visualize the crime patterns in every day of a week. Which day in a week has the largest amount of cases?
Answer.
wdNames = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
df['WeekDay'] = df['Date'].dt.dayofweek.astype('category').cat.rename_categories(wdNames)
resDF = df.groupby('WeekDay')['ID'].count().reset_index(name='Count')
resDF
# These are the number of crimes by each day of the week over the number of times that week day occurs in the year
# This metric allow us to asses the average daily number of crimes by each day of the week
resDF['Count1'] = resDF['Count']/pd.date_range("2017-1-1","2017-12-31",freq="D").dayofweek.value_counts()[::-1]
resDF
_ = resDF.plot(x = 'WeekDay', y='Count1', title='2017 Crime patterns in a week', marker='o')
_ = plt.ylabel("Average daily total incidents")
Another important dimension we need to consider is the relationship between crime incidents and geographic location. The technical aspect of graphing below can be seen as an extension on what we have previously learned about data visualizations. We recommend you slowly review the code in your spare time until you are able to replicate it on a new problem. How we choose to set up these graphs should always be logical and in this case, we are categorizing the data by police beats.
We have the rough geographic coordinate of each incident and based on these, we can explore the geographic patterns of crime in Chicago. To identify geographic hot spots of crimes, we can partition the City of Chicago into non-overlapping regions and count the total number of cases in 2017 in each region. In this case, we divide Chicago by police beats. We then visualize the results on the map:
# format the beat variable to have leading zeros, count by beat
df["beat_num"] = df["beat_num"].str.zfill(4)
# print(df['beat_num']) #Beat Num is a code with 4 numbers
beat_cn = df.groupby("beat_num")["ID"].count().reset_index(name="crime_count")
# print(beat_cn.head(6))
#
# color scheme
min_cn, max_cn = beat_cn['crime_count'].quantile([0.01,0.99]).apply(round, 2)
# print(max_cn)
# The 10% and 90% of crimes commited on each beat correspond to 290 and 2225 crimes/beat respectively
#
# These commands will create a scale with colors for each value of crimes/beat
colormap = branca.colormap.LinearColormap(
colors=['white','yellow','orange','red','darkred'],
#index=beat_cn['count'].quantile([0.2,0.4,0.6,0.8]),b
vmin=min_cn,
vmax=max_cn
)
colormap.caption="Total crimes in Chicago by police beats"
# load the shape file for Chicago police beats
beat_orig = geopandas.read_file("Boundaries_beat.geojson", driver = "GeoJSON")
# print(beat_orig.head(5))
# JOIN beat data and beat polygons
beat_data = beat_orig.join(beat_cn.set_index("beat_num"), how = "left", on = "beat_num")
# print(beat_data.head(5))
beat_data.fillna(0, inplace = True)
# interactive visualization for beat-specific crime rate in 2017
m_crime = folium.Map(location=[41.88, -87.63],
zoom_start=12,
tiles="OpenStreetMap")
style_function = lambda x: {
'fillColor': colormap(x['properties']['crime_count']),
'color': 'black',
'weight':1,
'fillOpacity':0.5
}
stategeo = folium.GeoJson(
beat_data.to_json(),
name='Chicago beats',
style_function=style_function,
tooltip=folium.GeoJsonTooltip(
fields=['beat_num', 'crime_count'],
aliases=['Beat', 'Total crime'],
localize=True
)
).add_to(m_crime)
colormap.add_to(m_crime)
m_crime
Overall, we find there are three hot spots in Chicago: Downtown Chicago, West Chicago, and South Chicago. You can hover over each region to see the beat number and the total number of crimes in the beat.
Based on the analysis so far, what are your preliminary strategies for police deployment based on times, locations and types of crimes? What is the potential business problem you are solving here?
Answer.
Crime tends to occur more in Fridays compared to other days of the week and summer months (August and September), when taking into account location localization we see there are more crimes in Downtown Chicago, West Chicago and South Chicago, so deployment of police force should be done with more police on this locations and times.
What is the main shortcoming of our analysis and recommendations in 4.1?
Answer.
The main shortcoming is that we are just looking one variable at the time, there could be interactions among them.
</div>
In Exercise 7.2, we cited two potential problems with naively tacking together the patterns we noticed for each individual variable into a recommendation. We tackle the first issue to start: there may exist interaction effects among the variables of interest. Similar to the previous EDA case, we now investigate each potential interaction effect in more detail and challenge our hypotheses.
Again, we can use a contingency table to answer this question just like we did for Primary Type and Location Description. One catch here is that you need to normalize the data so that the comparisons are fair across different days of the week:
res_raw = pd.crosstab(df["Primary Type"], df.WeekDay)
wkDayNum = pd.date_range("2017-1-1","2017-12-31",freq="D").dayofweek.value_counts().tolist()[::-1]
print(wkDayNum)
sns.heatmap(res_raw/wkDayNum)
res_raw/wkDayNum
sns.heatmap(np.transpose(np.transpose(res_raw)/df.groupby(['Primary Type'])['ID'].count())/wkDayNum)
Your colleague claims that most thefts happen on Mondays, Tuesdays, or Wednesdays. How do you validate or disprove their claim based on the data and the table above? What can you say about the days which have the most battery and assault incidents?
Answer.
resDF1 = (res_raw.loc[['THEFT', 'ASSAULT', 'BATTERY'],:]/wkDayNum)\
.rename(columns=str)\
.reset_index()\
.melt(id_vars=['Primary Type'], value_vars=wdNames)
print(resDF1.head(5))
#
fig, ax = plt.subplots(figsize=(8,6))
for label, dataFrame in resDF1.groupby('Primary Type'):
dataFrame.plot(x='WeekDay', y='value', marker='o', ax = ax, label=label)
ax.set(xlabel='WeekDay', ylabel='Average daily crime count', xticks=range(7), xticklabels=wdNames)
plt.legend(loc=1)
print('')
According to the data, theft tends to occur more in Fridays compared to other days of the week (this disproves the claim). It is possible that on Fridays there are more people on the street and this has an influence on the robbery rate on this specific day of the week.
Battery tends to occur more on weekends, it is possible that this has to do with the fact that these days people have leisure time there are interpersonal conflicts within homes. On the other hand, we have that assault is reported on each day of the week with the same proportion, there's not a pattern here.
The next potential interaction is between crime time and crime location. The geographic hot spots might shift from time to time and targeting different regions at different times is a natural strategy to increase efficiency. The following map shows how the crime rate varies geographically over time. Here, the outcome of interest is average daily total incidents:
def folium_slider( beat_cn, beat_orig, tmp_drange, index_var, index_lab,
value_var = "crime_count", caption = "Crimes in Chicago" ):
# get colorbar
min_cn, max_cn = beat_cn[value_var].quantile([0.01,0.99]).apply(round, 2)
colormap = branca.colormap.LinearColormap(
colors=['white','yellow','orange','red','darkred'],
#index=beat_cn['count'].quantile([0.2,0.4,0.6,0.8]),
vmin=min_cn,
vmax=max_cn
)
colormap.caption=caption
# get styledata for folium
styledata = {}
for beat in range(beat_orig.shape[0]):
res_beat = beat_cn[beat_cn.beat_num==beat_orig.iloc[beat,:].beat_num]
#fill missing value by zero: no recorded crime that month
c_count = res_beat.set_index(index_var)[value_var].reindex(tmp_drange).fillna(0)
df_tmp = pd.DataFrame(
{'color': [colormap(count) for count in c_count], 'opacity':0.5},
index = index_lab
)
styledata[str(beat)] = df_tmp
styledict = {
str(beat): data.to_dict(orient='index') for
beat, data in styledata.items()
}
# plot map and time slider
m = folium.Map(location=[41.88, -87.63],
zoom_start=12,
tiles="OpenStreetMap")
g = TimeSliderChoropleth(
beat_orig.to_json(),
styledict=styledict
).add_to(m)
folium.GeoJson(beat_orig.to_json(), style_function = lambda x: {
'color': 'black',
'weight':0.8,
'fillOpacity':0
}, tooltip=folium.GeoJsonTooltip(
fields=['beat_num'],
aliases=['Beat'],
localize=True
)).add_to(m)
colormap.add_to(m)
return m
# cycle in a year
df.rename(columns={'Month':'month', 'Day':'day','WeekDay':'dayofweek'}, inplace=True)
#
beat_cn_month = df.groupby(["beat_num","month"])["ID"].count().reset_index(name = "crime_count")
nd = pd.DataFrame({"month":range(1,13), "days":[31,28,31,30,31,30,31,31,30,31,30,31]})
beat_cn_month = beat_cn_month.merge(nd, how = "left", on = "month")
beat_cn_month["crime_count"] = beat_cn_month["crime_count"]/beat_cn_month["days"]
folium_slider( beat_cn_month, beat_orig, list(range(1,13)), "month",
list(pd.date_range( "2017-1", "2017-12", freq = "MS").strftime("%Y-%m")),
caption = "Average daily total incidents in a month")
From the plot above, what patterns do you observe over time in Downtown Chicago, as well as the West and South Sides of Chicago? Based on this, do we need to refine the strategies we outlined in Exercise 7.1?
Answer.
Our strategies developed in the previous section suggest that we should pay attention to Fridays as well as the months from May to August. However, is Friday always the most crime-prevalent day of the week regardless of the month we are looking at? Using the code we have developed above, we plot the crime patterns in a week for every month in 2017. Note that normalization is still required here. Based on the results, is our strategy in Exercise 7.1 to focus on Friday valid?
Answer.
res_md = df.groupby(['dayofweek','month'])['ID'].count().reset_index(name="count")
# normalization
date_2017 = pd.DataFrame(
{"dayofweek": pd.date_range("2017-1-1","2017-12-31",freq="D").dayofweek.astype("category"),
"month": pd.date_range("2017-1-1","2017-12-31",freq="D").month } )
date_2017["dayofweek"] = date_2017["dayofweek"].cat.rename_categories(["Mon","Tue","Wed","Thu","Fri","Sat","Sun"])
nd_2017 = date_2017.groupby(['month'])['dayofweek'].value_counts().sort_index().reset_index(name="day_count")
res_md_norm = nd_2017.merge( res_md, how = "left", on = ["month","dayofweek"]).fillna(0)
res_md_norm['count_norm'] = res_md_norm['count']/res_md_norm['day_count']
res_md_norm['dayofweek'] = res_md_norm['dayofweek'].astype("category").cat.reorder_categories(["Mon","Tue","Wed","Thu","Fri","Sat","Sun"])
res_md_norm['month'] = res_md_norm['month'].astype('category').cat.rename_categories(["Jan","Feb","Mar","Apr","May","Jun","July","Aug","Sep","Oct","Nov","Dec"])
mp = sns.lineplot(data=res_md_norm, x='dayofweek', hue = 'month', y='count_norm',
palette = sns.color_palette("hls",12), marker='o')
mp = mp.legend(loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=1)
_ = plt.ylabel("Average daily total crimes in a month")
_ = plt.title("Crime patterns across all days in a week in different months")
_ = plt.xticks(range(8))
for i in range(12):
tmp = res_md_norm[res_md_norm.dayofweek=="Sun"]
_ = plt.text( 6, tmp['count_norm'].iloc[i], tmp['month'].iloc[i])
So far, we have based our analysis on just the total crime rate. But not all crimes are equally harmful. A homicide case would severely affect a neighborhood even after several years and hinder business development in the area. On the other hand, a case of petty theft is usually not as destructive and would be dismissed after several weeks.
We can define a different type of outcome which emphasizes crime types that need to be controlled to the minimal level. These crimes are generally determined by municipal development plans of Chicago. For example, if the government aims to promote tourism, crimes targeted at tourists, such as theft and deceptive activities should be the main focus of the police department.
In our dataset, the column IUCR is a reporting code that partially measures the damage of an incident to the general well-being of the public. Let's use this code to define a new type of outcome which roughly measures the accumulated damage of all crimes in an area over a period of time. The key idea is that we should focus on places and times that are harmed by crimes the most, not necessarily the ones with the highest total crime incidents.
IUCR column stands for Illinois Uniform Crime Reporting code, and we can find a full key here. The most important take-away is that as code number increases, the severity of associated crimes generally decreases.
df['IUCR'].head()
Some of the IUCR codes have a letter after them (A, P, B, R, C, T, N). An examination of the above link shows that these letters are about convictions, which isn't too relevant to us. Write code to remove the letters and convert this column to a numeric type. Then use function hist to visualize the distribution of IUCR. Also use function describe to summarize IUCR (mean, variance, etc). Based on the histogram, do most cases have large IUCR (not severe)?
Answer.
# df.IUCR
import re
df['ICUR_M'] = df.IUCR.apply(lambda x: int(re.sub('[A-Za-z]','',x)))
df['ICUR_M'].describe()
df.ICUR_M.hist()
We can see that most cases have a raw IUCR score smaller than 2000 and around 30% of cases have a raw IUCR score smaller than 500. This means that most cases in our dataset are considered to be moderately severe.
Given these IUCR scores, how would you integrate them into our crime incidence visualizations so that the final results also represent the severity of crimes in a region?
Answer.
Maybe we could make a new variable that corresponds to multiplication of number of incidents along with ICUR for each type of crime. That variable could help us to assess the overall crime burden according to each factor.
# weighted incidence rate
df['ICUR_M'] = df['ICUR_M'].fillna(0)
df["IUCR_num"] = df['ICUR_M'].max() - df['ICUR_M']
beat_cs_month = df.groupby(["beat_num","month"]).aggregate({"IUCR_num":lambda x: sum(x)}).reset_index()
nd = pd.DataFrame({"month":range(1,13), "days":[31,28,31,30,31,30,31,31,30,31,30,31]})
beat_cs_month = beat_cs_month.merge(nd, how = "left", on = "month")
beat_cs_month["severity_tot"] = beat_cs_month["IUCR_num"]/beat_cs_month["days"]
folium_slider( beat_cs_month, beat_orig, list(range(1,13)), "month",
list(pd.date_range( "2017-1", "2017-12", freq = "MS").strftime("%Y-%m")),
value_var = "severity_tot", caption = "Average daily crime severity in Chicago")
Since IUCR was not particularly useful, let's define a different severity metric. Generally, it is good practice to define a metric that aligns with the bigger-picture goal that the city is trying to achieve. For example, if we want to attract more large companies to open up branches in Chicago, we may care a lot about homicide, sexual assault, and arson, but less about gambling, obscenity, and theft. This type of analysis is rather subjective, but with experience in the field we should gain good intuition about how every type of crime should be bucketed. If we want to be more objective, we could include another dataset of crimes which have been classified by the dollar amount in losses caused by those crimes. We start with a subjective scoring system as an example:
# The zip function creates a list of tuples out of two lists
# The first element of each tuple is the crime type from the first list, and the second element is the severity number
severity_10 = zip(['CRIM_SEXUAL_ASSAULT', 'ARSON', 'HOMICIDE'], [10] * 4)
severity_9 = zip(['BATTERY', 'ASSAULT','ROBBERY', 'BURGLARY'], [9] * 4)
severity_8 = zip(['MOTOR VEHICLE THEFT', 'PUBLIC PEACE VIOLATION', 'CRIMINAL DAMAGE'], [8] * 3)
severity_7 = zip(['CRIMINAL TRESPASS', 'OFFENSE INVOLVING CHILDREN', 'KIDNAPPING'], [7] * 3)
severity_6 = zip(['STALKING', 'PUBLIC INDECENCY'], [6] * 2)
severity_5 = zip(['OTHER OFFENSE', 'HUMAN TRAFFICKING'], [5] * 2)
severity_4 = zip(['DECEPTIVE PRACTICE', 'INTIMIDATION'], [4] * 2)
severity_3 = zip(['INTERFERENCE WITH PUBLIC OFFICER', 'SEX OFFENSE'], [3] * 2)
severity_2 = zip(['NARCOTICS', 'WEAPONS VIOLATION', 'CONCEALED CARRY LICENSE VIOLATION','OBSCENITY'], [2] * 4)
severity_1 = zip(['THEFT','GAMBLING', 'PROSTITUTION', 'LIQUOR LAW VIOLATION', 'NON-CRIMINAL (SUBJECT SPECIFIED)', 'NON-CRIMINAL'], [1] * 5)
# By turning these zipped tuples into a dictionary, we can map each row of the dataset to its severity label
severities = dict()
for s in [severity_1, severity_2, severity_3, severity_4, severity_5, severity_6, severity_7, severity_8, severity_9, severity_10]:
severities.update(dict(s))
# Our last step is mapping primary_type using this dictionary
df['severity_bus'] = df['Primary Type'].apply(severities.get)
This weighting scheme is suitable for visualizing average per case severity but not for the total severity. The reason is that the range of the weights is rather small (1-10) and crimes with high weights are rare. As a result, the weighted sum given by this weighting scheme is very similar to the unweighted sum and visualizing the weighted sum does not provide additional information. The average per-case severity on the other hand is bounded between 1 and 10. So a small change (e.g. 0.1) in the average can still lead to apparent change in the filling color. Let's take a look at the updated average per case severity score:
# note that we still need to do normalization by days
beat_cs_month = df.groupby(["beat_num","month"]).aggregate({"severity_bus":lambda x: sum(x)}).reset_index()
# print(beat_cs_month)
nd = pd.DataFrame({"month":range(1,13), "days":[31,28,31,30,31,30,31,31,30,31,30,31]})
beat_cs_month = beat_cs_month.merge(nd, how = "left", on = "month")
beat_cs_month["severity_bus_tot"] = beat_cs_month["severity_bus"]/beat_cs_month["days"]
beat_cs_month["severity_bus_avg"] = beat_cs_month["severity_bus_tot"]/beat_cn_month["crime_count"]
folium_slider( beat_cs_month, beat_orig, list(range(1,13)), "month",
list(pd.date_range( "2017-1", "2017-12", freq = "MS").strftime("%Y-%m")),
value_var = "severity_bus_avg", caption = "Average crime severity in Chicago")
This map essentially shows the violent parts of Chicago, ignoring theft and putting emphasis on the most violent crimes. Under this prioritization scheme, we see that Downtown Chicago is no longer the worst region; rather, the worst region has now become South Side Chicago. The "safe" North Side is now dotted with some rather violent beats.
We explored Chicago crime records in 2017 to understand crime patterns and proposed preliminary policy deployment strategies based on these patterns. We initially examined the patterns associated with each variable of interest independently. Based on the single variable analyses, we proposed that the police department should put extra forces to work on Fridays between May and August, pay more attention to theft, battery, assault and criminal damage, and deploy more forces in Downtown Chicago and the West and South Sides of Chicago.
We then conducted analyses for three pairs of variables and found that the above strategies are too rigid and don't take into account interaction effects. We found that Friday is a weekly hot spot outside of summertime, but during summertime, either Saturday becomes the hot spot or no hot spot was present at all. We also found that the time windows of high criminal activities for the three geographic hot spots are not the same: Downtown has a high rate throughout the year, whereas the crimes in the other two regions mostly accumulated between April and August. We found that theft is common in all kinds of locations whereas other types of crimes, such as battery and assault, tend to cluster in a very few number of location types.
In the last part of the analysis, we looked at how to customize weights of different crimes based on the business outcome we were optimizing for. We considered a custom severity score, which highlights the extremely violent cases, and found out that many regions in Chicago have low overall crime incident counts but high violent crime rates. Downtown Chicago, on the other hand, did not harbor violent crimes despite its high overall crime rate. This indicates that if eliminating highly violent crimes is the priority, the previously devised strategies should be changed and there should be more emphasis placed on locations like North Side Chicago.
Moving forward, there are many things we can do. Using this dataset, we can explore other pairwise interaction effects. We can also examine if the strategies we proposed here have already been implemented and whether the deployment plan in use now is successful or not. We can also consider more advanced statistical modeling for our dataset so that all variables can be included, and not just the ones we examined. For those that are interested in this topic, a good starting point is here.
In this case, we learned how to perform exploratory data analysis for records of crime. This type of data does not have as clear of an outcome of interest, so we started by assuming that the outcome of interest is the total number of crime incidents. From there, we followed a similar process as in the last EDA case, except:
Finally, we questioned our original assumption that the number of crime incidents was the most important outcome to optimize for. We also looked at how we ought to weight different crimes differently in our analysis based on the particular business problem at hand.
Case 4.3 was focused on EDA for numerical parameters while Case 4.4 was focused on categorical variables. We simply cannot overemphasize the importance of EDA for any data science problem so we recommend you review both cases rigorously. If you are comfortable with the concepts and technical aspects of both cases, you can find a dataset with a mix of categorical and numerical parameters on a platform like Kaggle for further practice. There is no such thing as too much EDA practice.